Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Handshake and Aux Chain out of Opgroups #133

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

Lynx005F
Copy link

@Lynx005F Lynx005F commented Jul 4, 2024

PR to keep in sync with pulp-platform#10

Summary

Introduces new modules that hold the handshake and aux data and use reg-enable to pass data through individual lanes.
This has the following benefits:

  • Removes "play" in between lanes so they can no longer get out of sync due to single event upsets in handshake (original reason for implementing it)
  • Fixes bug where division lanes could get out of sync and output result twice when different lanes take different amount of cycles.
  • Fixes bug where if the unit is stalled and a lower precision FP op follows a higher precision one the lower precision one can "lose" the upper lane results.

In the future this change could also allow operations where Lane 0 is not used e.g. non-vectorial lower precision formats could be calculated on low-precision only units.

Testing

So far I ran the following tests

  • This Version:
    • Mixed Opgroup Test
    • Mixed Vector / Non-vector and Opgroup Test
    • Tests specifically targeting TH32, THMULTI and PULP dividers
  • Pulp Version
    • All of the above
    • Standalone Synthesis to check that Area & Critical Path does not change significantly
    • Full System Synthesis to check that this does not introduce any kind of external Loop

If you want any of my testbenches, please contact Luca Bertaccini

Structure

The commits in this PR have the following structure:

  1. Revert any changes regarding the external reg_enable
  2. Change to aux modules
  3. Re-implement external reg_enable

Notes:

  • This implements the FSM restart on external reg enable for PULPDIVSQRT and THMULTI but not for TH32 - this matches the previous implementation, but it might make more sense to support it everywhere.
  • In fpnew_opgroup_multifmt_slice.sv the signals conv_target_q, result_vec_op and result_is_cpk are not used, meaning the section L490 - L519 (and a few other lines) are superfluous. I modified them as if they were used.

@lucabertaccini
Copy link
Collaborator

LGTM!
In the current version, the opgroup_ready is driven by lane_in_ready[0] (https://github.com/openhwgroup/cvfpu/blob/develop/src/fpnew_opgroup_multifmt_slice.sv#L122), while the opgroup_valid is forwarded to all the units. This can create issues in a SIMD DivSqrt implementation when interleaving operations that do not always use the same number of lanes (scalar-SIMD or SIMD_fmt1-SIMD_fmt2) as the operation is multicycle, lane_in_ready[0] could be 0 while lane_in_ready[n>0] could be 1. This means lane1 sees a valid input handshake, while the (higher hierarchy level) opgroup does not see a valid handshake. Therefore, the operation at the opgroup level will stay valid and eventually be recomputed on the higher lanes when lane0_ready gets to 1.

@michael-platzer
Copy link
Contributor

@Lynx005F thanks for the PR, I would also like to offer some feedback.

In general, I am totally in favor of a cleaner separation between control flow and data flow (e.g., moving handshake, tag, and aux signals away from the actual computations). That being said, it seems to me that moving the control logic out of the lanes but keeping it in the opgroups only does the job halfway. Since you are already undertaking such a large refactoring, I would suggest to move the control logic (i.e., the new fpnew_aux module that you are introducing with this PR as well as all the handshake and tag signals) out of the opgroups completely and only keep the external register enable ports to control all the pipeline registers.

@Lynx005F
Copy link
Author

Lynx005F commented Jul 16, 2024

I think the upper limit to where we could put the fpnew_aux and fpnew_aux_fsm is the fpnew_opgroup_block since at this level we have the first round of RR-Arbitration. If you want to put it any higher you have to start with externally controlled RR-Arbiters for the data, which will definitely make the whole thing more complicated.

What could be doable is to put the fpnew_aux and fpnew_aux_fsm inside the format loop in the fpnew_opgroup_block.

  • One of the ugly things that happens then is that the fpnew_opgroup_block is no longer agnostic to which opgroup it is in, since not every opgroup uses the same aux module.

  • How many lanes are needed and which ones are active would now also need to be known in fpnew_opgroup_block to propperly feed fpnew_aux and fpnew_aux_fsm. So you would have to move this code up as well, and then create ways to pass it back down as parameters.
    I am not sure if you spotted this dependency, please check the in_lane_active signal and the LANE_FORMATS, ACTIVE_FORMATS etc. parameters that fpnew_aux and fpnew_aux_fsm depend on.

I think that could be feasible, but I can't really see why you think this is should go hand-in-hand with my previous changes. The aux modules primarily allow easy per-lane functionality and merge any non lane specific functionality, so I think they should be on the level of the lane for-loop.

@michael-platzer
Copy link
Contributor

I think the upper limit to where we could put the fpnew_aux and fpnew_aux_fsm is the fpnew_opgroup_block since at this level we have the first round of RR-Arbitration. If you want to put it any higher you have to start with externally controlled RR-Arbiters for the data, which will definitely make the whole thing more complicated.

Agree.

What could be doable is to put the fpnew_aux and fpnew_aux_fsm inside the format loop in the fpnew_opgroup_block.

That's where I would suggest to put it.

  • One of the ugly things that happens then is that the fpnew_opgroup_block is no longer agnostic to which opgroup it is in, since not every opgroup uses the same aux module.

As far as I have seen, the only difference between the fpnew_aux and the fpnew_aux_fsm modules is the FSM code in between the input and output pipeline stages. If lane_fsm_ready_i is tied to '1, then they should behave identically. So technically, all opgroups could use the fpnew_aux_fsm module and tie lane_fsm_ready_i to '1 if the FSM is not needed.

  • How many lanes are needed and which ones are active would now also need to be known in fpnew_opgroup_block to propperly feed fpnew_aux and fpnew_aux_fsm. So you would have to move this code up as well, and then create ways to pass it back down as parameters.

I do not think that this is strictly necessary either. The aux module could provide a single reg enable for each stage and that signal could then be distributed to each lane and individually gated within the opgroups. The lane count should not be required for the FSM either, since the first thing that happens to the lane_fsm_ready_i from the individual lanes within the FSM aux module is an AND reduction (that reduction could be moved into the opgroups and only the fsm_ready signal be fed to the aux module) and for the lane_fsm_start_o and lane_fsm_kill_o signals again a single signal could be issued by the aux module and then within each opgroup that signal is distributed to each lane and gated according to whether that lane is active.

I am not sure if you spotted this dependency, please check the in_lane_active signal and the LANE_FORMATS, ACTIVE_FORMATS etc. parameters that fpnew_aux and fpnew_aux_fsm depend on.

The lane active signal could stay within the opgroups (or even in the individual lanes/slices).

I think that could be feasible, but I can't really see why you think this is should go hand-in-hand with my previous changes. The aux modules primarily allow easy per-lane functionality and merge any non lane specific functionality, so I think they should be on the level of the lane for-loop.

Your aux module could remain completely agnostic to the lane count. There is no arbitration or similar control decisions that depend on the individual lanes and even your FSM code operates all lanes in sync. This is why I would suggest to raise the instantiation of this aux module and all the control logic within it to the highest level that does not involve arbitration decisions. I agree that the fpnew_opgroup_block is the upper limit and that is where I would suggest to move it.

@Lynx005F
Copy link
Author

Ah I see what you mean, I actually had some versions that did it kind of like that and then ended up going with this version instead. But I think there is probably way to make it more simple than what I have now.

The main difference in between aux and aux_fsm is the minimum latency. aux_fsm has one pipeline register that is parallel to the FSM itself to store the relevant data, meaning it always outpus after one cycle. I am pretty sure it is possible to create a version of aux_fsm that can also output immediately if needed, but then we have to change the definition of pipeline stages for the division, allocating the 1st pipeline stage to the base functionality of having an FSM (I think then you would also have to convert start_fsm into one bit inside the reg_enable signal).

@Lynx005F
Copy link
Author

I made some lane-agnostic versions of the aux chain modules for the pulp version to see how it would look like, I will port it over soon:
https://github.com/Lynx005F/cvfpu/tree/itemm/per_lane_reg_enable

I think this is the first step towards the direction that you want to go. The next step would now be to combine the aux and aux_fsm modules into one unified version before moving them up (in order to have opgroup-specific code on multiple levels).
If you have any good Ideas how to unify the two different behaviours above let me know, so far I have no good intuition for how to tackle this.

@Lynx005F
Copy link
Author

I made some of the requested changes, while I ignored others:

  • Aux modules are now lane-agnostic, lane stuff is handled in Slice.
  • There are still two aux modules as they have different delays and simply setting fsm_ready = 1 doesn't result in the same behavior. It would be possible to make one parameterized version that can do both, but I don't think the operation is logically identical.
  • Aux modules still on slice level as the instantiation is still dependent on which Opgroup they are in. It feels much better to me to have everything related to selecting the Opgroup in one file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants